Personalized prediction of incident hospitalization for cardiovascular disease in patients with hypertension using machine learning

Background Prognostic information for patients with hypertension is largely based on population averages. The purpose of this study was to compare the performance of four machine learning approaches for personalized prediction of incident hospitalization for cardiovascular disease among newly diagnosed hypertensive patients. Methods Using province-wide linked administrative health data in Alberta, we analyzed a cohort of 259,873 newly-diagnosed hypertensive patients from 2009 to 2015 who collectively had 11,863 incident hospitalizations for heart failure, myocardial infarction, and stroke. Linear multi-task logistic regression, neural multi-task logistic regression, random survival forest and Cox proportional hazard models were used to determine the number of event-free survivors at each time-point and to construct individual event-free survival probability curves. The predictive performance was evaluated by root mean squared error, mean absolute error, concordance index, and the Brier score. Results The random survival forest model has the lowest root mean squared error value at 33.94 and lowest mean absolute error value at 28.37. Machine learning methods provide similar discrimination and calibration in the personalized survival prediction of hospitalizations for cardiovascular events in patients with hypertension. Neural multi-task logistic regression model has the highest concordance index at 0.8149 and lowest Brier score at 0.0242 for the personalized survival prediction. Conclusions This is the first personalized survival prediction for cardiovascular diseases among hypertensive patients using administrative data. The four models tested in this analysis exhibited a similar discrimination and calibration ability in predicting personalized survival prediction of hypertension patients.

Page 2 of 11 Feng et al. BMC Medical Research Methodology (2022) 22:325 is then projected onto an individual [4]. There is little research focusing on individual-level prediction.
In this study, we consider several important machine learning (ML) methods that produce the entire survival probability curve for individual patients. Recently, research has reported the risk analysis and survival prediction for cancer patients by machine learning techniques as well as on different input features and data samples. Weng el at. report using machine learning to improve accuracy of cardiovascular risk [5]. Results found that machine learning method improves accuracy of cardiovascular risk prediction, increasing the number of patients identified who could benefit from preventive treatment, while avoiding unnecessary treatment. Bharath et al. also found similar results in that machine learning improve prediction accuracy in CVD prediction model in an initially asymptomatic population [6]. Although machine learning methods have shown encouraging success on predicting some medical conditions, it has not been applied to individually CVD survival prediction in patients with hypertension by using routinely collected large digital electronic administrative health data. If the large administrative data set can be exploited using machine learning algorithm, it may open the way to optimise the use of collected administrative data to assist in predicting patents' outcome, planning individualised patient care, monitoring resource utilization and improving institutional performance. Including comorbidity status, demographic data, lab test results and medication would improve assessment of prognosis and guide treatment decisions for hypertension patients.
Use of machine learning methods in clinical oncology has shown success [7,8], but this methodology has not been more broadly applied to other clinical areas. Addressing this, the purpose of this study was to compare evaluate four ML approaches for personalized prediction of incident hospitalization for heart failure (HF), myocardial infarction (MI), and stroke among newly diagnosed hypertensive patients using routinely collected administrative health data. To our knowledge, this the first study to develop and validate different state-of-theart ML models for individual CVD outcome prediction in hypertensive patients.

Data sources and study population
A retrospective cohort was assembled using linked administrative health databases from Alberta Health with information including demographic and vital statistics, physician billing claims, medication dispensations, hospital separation data, and laboratory services (Fig. 1). These data have been used in previous studies and shown to be high-quality and comprehensive [9,10].
The study population included all newly diagnosed cases of hypertension aged 18 to 99 years who were residents of Alberta. We identified hypertension cases using a validated case definition of two physician claims within two years or one hospitalization with hypertension related diagnosis codes (ICD-9-CM: 401.x, 402.x, 403.x, 404.x or 405.x; ICD-10-CA: I10.x, I11.x, I12.x, I13.x or I15.x) [11]. The first date of the hypertension diagnosis (index date) was assigned to patients for case definitions with more than one hypertension diagnosis. We   Fig. 2.

Predictors
Potential predictors were selected a priori based on previous studies and clinical reasoning [13]. Age was categorized into four age groups as the predictor in this study. Patients' demographic information, such as sex, region of residence was also used as predictors. The number of Charlson comorbidities present in each patient was categorized into "0, 1-2, or ≥ 3. " Fasting blood glucose, estimated glomerular filtration rate (eGFR), cholesterol levels and Glycated haemoglobin (A1c) (HbA1c) was determined between hypertension index date and outcome date [14,15]. Test results outside the standardized reference intervals were used (Blood glucose > = 7.0 mmol/L, eGFR < 60 mL/min/1.73m 2 , Cholesterol levels > 3.5 mmol/L, HbA1c > =6.5%) [16].
The following categories of antihypertensive medications have been shown to reduce cardiovascular risk and were identified using the anatomical therapeutic chemical (ATC) classification system: beta blockers (ATC codes in category C07, excluding C07AA07, C07AA12 and C07AG02); agents acting on the renin-angiotensin system (ATC codes in category C09); thiazide diuretics (ATC codes in category C03, excluding C03BA08 and C03CA01); calcium channel antagonists (ATC codes in category C08); and miscellaneous antihypertensives (ATC codes in category C02, excluding C02KX01) [17]. Respondents were categorized as using antihypertensive medication if an ATC code corresponded to the above list between the hypertension index date and outcome index date.

Statistical analysis
The study cohort was randomly divided into training (70% of total: n = 181,911) and validation (30% of total: n = 77,962) sets (Fig. 2). Multicollinearity between predictor variables was assessed using condition indices and variance proportions. Those with significant correlation were removed from the model. The linear multitask logistic regression (LMTLR) model is an alternative to the Cox's proportional hazard (CoxPH) model. It can be seen as a series of logistic regression models built on different time intervals to estimate the probability that the event of interest happened within each interval. The constructed LMTLR included 25 features and 50 intervals in this study. The neural multi-task logistic regression (NMTLR) allows the use of Neural Networks within the original multi-task logistic regression (MTLR) design. We used the same 25 features, 100 neurons in the first hidden layer and 100 neurons in the second hidden layer, and one output neuron before input to LMTLR. The random survival forest (RSF) is an extension of the Random Forest model that can take into account censoring individuals. We used 50 trees, the maximum depth of 5 and minimum number of samples required to be at a leaf node at 20 for the model development. The CoxPH is a semi-parametric model that focuses on modeling the hazard function, by assuming that its time component and feature component are proportional over time. The maximum number of iterations in the Newton optimization in this model was 600.

Model validation
The final survival prediction model was tested within the test dataset for those four models (LMTLR, NMTLR, RSF, CoxPH) [18]. The actual and predicted number of patients that experienced the CVD event at each time t was compared by computing the actual survival function of the validation data, which can be obtained using the Kaplan-Meier estimator and compare it to the average of all predicted survival functions [18]. Root mean squared error (RMSE) and mean absolute error (MAE) was used to provide the comparison as well as the performance metrics between the actual and predicted number of hypertensive patients experiencing a CVD event at each time, t. Model accuracy was assessed using discrimination (concordance index (C-index)) and calibration (Brier score). Analyses were conducted using SAS version 9.4 [19], R software version 3.5.1 [20] and Python version 3.7.6 [21]. Descriptive statistics were generated by SAS (Tables 1   and 2). The package 'survival' in R was used to produce Fig. 1 for survival analysis. 'PySurvival' in Python was used for ML model analyses. All the methods were performed in accordance with relevant guidelines and regulations.

Cohort characteristics
We identified 299,826 newly diagnosed hypertensive patients between April 1, 2009, and March 31, 2015. After applying exclusion criteria, there was a total of 259,873 patients with 899,393 person-years of follow-up and collectively with 11,863 events over a median followup time of 3.5 years (inter-quartile range 2.2 to 4.8 years). The incidence rate was 13.4 CVD hospitalizations per 1000 person-years. Among the study population 9182 (3.5%) patients died within the study period. The mortality rate during the follow-up period was 10.0 per 1000 person-years (95% CI: 9.8 to 10.2 per 1000 person-years).
The crude incidence of composite CVD hospitalization was 13.2 (95%CI: 13.0-13.4) per 1000 person-years. Hospitalization for MI was most common (6.1 (95%CI: 6.0-6.3) per 1000 events per person-years), followed by HF (5.6 (95%CI: 5.4-5.7) events per 1000 person-years), and lastly stroke (3.4 (95%CI: 3.3-3.5) events per 1000 person-years) ( Table 2, locate at the end of the document text file). The composite CVD hospitalization rate was higher for men, and this was driven by an excess risk of MI. Hospitalizations were most common in patients above the age of 75 years, those residing in rural locations, and individuals with at least two other Charlson   comorbidities both for the composite outcome and its individual components. Figure 3 shows Kaplan-Meier plots of the cumulative probability of being free of hospitalization for any CVD, HF, MI, and stroke as a function of survival time among newly diagnosed hypertension patients. MI had the lowest cumulative probability in the entire survival period when compared with HF and stroke. Table 3 shows a comparison of the actual and predicted number of hypertension patients experiencing CVD events at each time point for all four models. The RSF model had the lowest RMSE at 33.94 and the lowest MAE at 28.37, indicating the best fit. By plotting the individual's survival function over time, we compared the survival probability of developing CVD events among individual hypertensive patients. Table 4 shows the results when applying all four models in our hypertension cohort. Overall, all the models had a C-index > 0.70 and Brier score < 0.25 [22], representing a strong predictive ability in the validation set. Adding two layers of neural network before LMTLR, the NMTLR model had the highest C-index (0.82) and lowest Brier score (0.02); the best discrimination and calibration of all the models for event-free survival prediction. This suggests the NMTLR is most accurate and outperformed the other models in predicting CVD outcomes for incident hypertensive patients. Figure 4 visualizes the LMTLR, NMTLR, RSF and CoxPH models for two representative patients from the validation set. Patient one (red line) had a short eventfree survival of 4.0 years from diagnosis of hypertension until being hospitalized for CVD, while patient two (blue line) had a comparatively longer event-free survival of at least 4.9 years before being censored. Patient one developed CVD after 4.0 years diagnosed as hypertension, however, patient two may have been lost to follow-up or did not develop CVD at the end of the study or death until 4.6 years after diagnosed as a hypertensive patient. The median survival time (50% survival probability) as a point estimation for survival time predication was used in the study for personalized survival prediction performance evaluation. If the 50% survival probability is close to the survival time, the model has more accurate prediction performance. All four models performed well in predicting the prognosis for patient one whose 50% survival probability corresponded with the actual observed 4.0 years of event-free survival. However, only the NMTLR model provided accurate prediction of 50% survival probability for patient two who was lost survival information at 4.9 years in this study. For other models, take the LMTLR for example, in fig. 4(a), the survival probability for patient 2 at 4.9 years survival time is near 90%, and this patient's 50% survival probability is nearby 6.6 years. Although patient 2 passed the 50% survival probability after the 4.9 years in image (a), this patient's 50% survival probability does not close to the 4.9 survival years, which indicate the model could not well predict this patient's survival information in this model. The NMTLR was able to handle the presence of censoring better than the other models. Moreover, the individual survival curves for these two patients intercrossed at the beginning of the observation period. This may reflect the real situation that patient two has worse health condition or perhaps the patient is treated and controlled one year after being diagnosed with hypertension. Patient two have a pretty flat curve in the following period, however patient one became worse in the whole follow up period. The CoxPH model was unable to properly handle censoring, as represented by a horizontal survival probability line for patient two. Figure 5 shows the prediction results for two patients who had CVD outcome at 1.1 years and 2.3 years, respectively. Patient one had a hospitalization for CVD at 2.3 survival years while patient two was had hospitalization for CVD at 1.1 survival years. All of those four models can discriminate the two patients' survival  time versus survival probability. Patient two's survival curve was always lower than patient one and reached 50% survival probability faster than patient one. For NMTLR, the actual survival time corresponded closely to the estimated survival probability for both patients. For patients two, the survival curve was consistently lower than patient one, and the 50% survival probability occurred earlier for the first patient. Only NMTLR correctly predicted the survival time for patient one at 2.3 years and patient two at 1.1 years, based on the projected 50% survival probability. Moreover, NMTLR also had the smoothest survival curves with distinct shapes predicted for the two patients, while the CoxPH model predicted survival curves with similar shapes because of the proportional hazard assumption. For RSF model, we observed that the survival function was monotonically decreasing and parallel. This is likely due to both patients being in the same tree branch node in the model development process.

Discussion
In this study, we explored the performance of ML models on predicting incident hospitalization for CVD distribution in patients newly diagnosed with hypertension using administrative data. All models were developed and validated using the same training dataset. All the four models had high discrimination with C-index > 0.70 and good calibration with Brier score < 0.25. We showed how ML models can be applied to accurately predict the occurrence of hospitalization for CVD on both population and individual levels.
For population use, the ML models can predict the number of patients with events at each time point using survival functions, similar to traditional regression methods. The RSF model had the best performance for population-based prediction compared to the other three models. Moreover, the RMSE and MAE were quite small in the RSF model, indicating that the prediction results were relatively constant during the 6-year follow-up period. As we did not force the use of any particular variables, the RSF algorithm was allowed to include any variables available in the administrative dataset that were associated with risk of hospitalization for CVD, which likely made the model more accurate than other models based on linear regression (LMTLR, NMTLR and CoxPH) for the population-based prediction [23,24]. In terms of individual-level prediction, the NMTLR model had the highest C-index and lowest Brier score which means it had the best discrimination and calibration for individual survival prediction [25]. This may be because of the unique properties of using a neural network [26]. Neural networks require initialization and adjustment of many individual parameter to optimize the performance of the classification [27]. The NMTLR model that combines the neural network and multi-task logistic regression together was developed empirically and can be best fit for the training data in our study. NMTLR models the survival function by combining multiple local logistic regression modes in a dependent manner followed by a two layers neural networking procedure. This allows it to handle censored observations and the time-varying effects of features naturally to provide better results compared to the model which will only use fully observed (uncensored) instances (like CoxPH) [23]. The combination of neural network and multiple task logistic regression in NMTLR allows the model to build a nonlinear statistical data modeling tool to deal with complex relationships and has shown better predictive performance than the other three models [28].

Strengths and limitations
To our knowledge, this is the first study to develop and validate models for individual CVD outcomes among patients with hypertension using administrative data. Utilizing administrative health data provides the opportunity to: 1) utilize risk factors that are routinely collected; and2) adopt the methods into existing hypertension and cardiovascular care practice and programs that are relevant for precision medicine. Further, there is considerable potential to use this data to improve clinical care cross a spectrum of chronic diseases. Our study results support that large administrative data provides sufficient statistical power to develop and validate predictive algorithms with a larger set of risk factors and greater specification of those risks, which in turn generate distinct survival probability for a wide range of health profiles or populations. Importantly, for individual-level prediction, our finding suggests that NMTLR has the best discrimination and calibration performance when compared to the other three models.
There are limitations to this study. Firstly, most patients in the study were followed up for 3 years, which may not be adequate to capture all CVD outcomes, especially for those younger and have a small number of comorbidities. Secondly, this study was retrospective and conducted in a single cohort. Further study is required to demonstrate generalizability of our findings. Thirdly, there are many important factors, such as blood pressure and other CVD medications, that were not included due to the limitations of administrative health data used in this study. Lastly, this study did not fully take into account missing data. Variables were included in the model even if one patient had a single value in the chart. This may have some-what diminished our predictive accuracy; however, a strength of this approach is that it represents the true nature of administrative health data with minimal transformations and with no data imputations. Another consideration is that we elected to define hypertension patients using a validated diagnosis codes with a high sensitivity and specificity. This methodology that using 1 hospitalization and 2 physician claims algorithm for hypertensive patients' definition could represents a more easily deployable solution to cohort building and model development.

Conclusions
This study demonstrated that four ML models utilizing administrative health data exhibited similar high discrimination and calibration in predicting incident hospitalization for CVD among hypertensive patients. Specifically, the NMTLR model had the best individual survival prediction and the RSF model had the best population survival prediction. Improved prediction of outcome has the potential to help clinicians make more meaningful decisions about treatment. Importantly, this study makes use of administrative health data that is already routinely collected but underexploited by clinical health systems. While ML methodologies have many advantages, to truly improve patient care and outcomes, methods for teasing out causal relationships will remain an important part of the health care and biomedical armamentarium.